Diary 2023-11-27
nishio First of all, "everyone puts GPT-generated text on the net, and future LLMs will think it is human input and learn it," and "that's why performance will degrade. As for the claim, the premise is not valid because it can be identified as LLM-generated text and filtered. cheedah7427 I think the watermarking is just a matter of saying that it can be countered to some extent if you want to identify it, not that it is 100% identifiable. And I think that eventually contamination will happen, because not all LLMs have that countermeasure, and we cannot assume a situation where we know that a certain sentence in reality was generated from an LLM with certain conditions. cheedah7427 Here the entire sentence generation system is referred to as LLM for simplicity nishio We were talking about the context of "because everyone writes ChatGPT-generated sentences on the Internet," and in this case OpenAI can identify that the sentences are ChatGPT-generated. We were talking about the fact that in this case OpenAI can identify that it is a ChatGPT-generated text and exclude it from the training data. Sure, the wide variety of LLMs in the open camp might crush each other because they can't identify each other. Maybe we need standardization? cheedah7427 I didn't follow the context, my apologies. It would be best if it could be standardized, but I feel that in reality it would be difficult... I feel that token generation is too intertwined with so many different factors! nishio No, no, I appreciate your point of view as it has broadened my perspective! ---
This page is auto-translated from /nishio/日記2023-11-27 using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.